Introduction to Open Data Science - Course Project

About the project

IODS-course introduces the basics of data science, and touches upon themes of open science and the use of open data. During the course participants learn basic tools to use with open data with R, and learn the principles of open science.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 12 16:27:45 2022"

Assignment1

Exercise set 1 is fairly good crashcourse to R. It introduces a lot of useful commands that are often needed to tidy the data for further analysis. Most of the commands were familiar, but I did not know that one can use pipe when doing analyses. That will be helpful with e.g. doing analyses for subset of the data. For a introduction-level book I prefer data analysis with R (the one with the parrot). Having read a couple of introductory level books on R and data-analysis I think the differences are minor, and at least any of the ones I have read would suffice for a beginner.

Did the thing with the token, took some time, but it works. Who would have thought that university laptops would require remote access to install GIT.

link to github diary: https://esoini.github.io/IODS-project/ repository: https://github.com/esoini/IODS-project


2: Regression and model validation

Describe the work you have done this week and summarize your learning.

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)

date()
## [1] "Mon Dec 12 16:27:45 2022"

Reading the data

rm(list = ls())

#setwd("/Users/eetusoin/Desktop/IODS")
lrn14<-read.csv("./data/learning2014.csv")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
fig <- ggpairs(lrn14, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
fig

lm<-(lm(points~attitude+stra+surf, data = lrn14))
summary(lm)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
lm1<-(lm(points~attitude+stra, data = lrn14))
summary(lm1)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
lm2<-(lm(points~attitude, data = lrn14))
summary(lm2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

##1 Data consist of ASSIST survey (Approaches and Study Skills Inventory for Students) and the measures were originally reported in finnish. Data has 166 observations and 7 variables, which are gender(male/female), age, mean scores in attitude scale.

##2 The distributioin for attitude, deep, stra, surf and points look fairly normally distributed. The distribution for age is positively skewed (longer tail on the right) as most of the student taking the survey have been young adults, and the plot shows some outliers. Strongest correlation is seen between attitude and points (.4), and the next strongest associations are with stra (r= .146) and surf (r= .-.144).

##3 Of the three variables used to predict points (attitude,stra,surf, had the strongest correlations), only attitude was associated with points (statistaical signifigance >.001). Lm was fitted again without the surf variable. As the surf was not statististically significant at alphalevel =.05, the final model had only attitude as predictor.

Lm fits a regression line to the data using least residual squares. Intercept reflects the points when all the parameters are set to zero. Model parameters were tested using t-test. T test tests if the coefficient differs from 0, ie. if the predictor affects the slope of the regresssion line. In the final model the increase of 1 point in attitude affects the predicted exam point by +3.5 point, with std error of 0.57.

Multiplle r squared refers to the overall variation that the model explains. It reflects the goodness-of-fit of the model. When comparing the models above, removing surf and stra from the analyses led to minimal attunuation of the adjusted r squared, thus those variables did not explain the exam points well. In the last model r-squared was 0.19, so the model with only attitude explained the approximately 19percent of the variation

par(mfrow= c(1,3))
plot(lm2,par(mfrow= c(2,2)), which = c(1,2,5) )

##4 The residuals vs. fitted vlaues show that most of the residuals are symmertical around zero, implicating that the model fits data ok. There seems to be greater variability with fitted values, which may be due the fact that variance is not constant, so transformation could be used to account that.

In the qq-plot the observations fit to the line, but there is some deviations in the low and high values. This indicates that the association may not be linear. Quadratic model could fit the data better.

Leverage vs. residuals shows that some observations are close to cooks distance, but none are significant. Hence there is no reason to remove any of the observations.

Overall, normality and constance of variance seem fine, and there are no observations that would significantly inmpact the model. Thus we can assue that all the assumptioms of linear regression were met.


3:Logistic regression

Of the variables in data, I chose sex, age, quality of family relationships and current health status. Rationale behind the chosen variables was that as sexes differ in their vulnerability to internalizing/externalizing problems, I hypothesize that male sex would be a risk factor for high use of alcohol. Adolescent in different ages may differ in the perceived peer-pressure to drink alcohol, or to blend in the group, hence I hypothesize that older age (ie. closer to twenty) may also be a risk factor. Lastly family relationships and current health status may cause distress in individual, and alcohol may be used as coping mechanism to alleviate the distress.

alc1<-alc[c("age","sex","famrel", "health", "high_use")]

fig <- ggpairs(alc1, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
fig

table(alc1$sex,alc1$high_use)
##    
##     FALSE TRUE
##   F   154   41
##   M   105   70
g2<-ggplot(alc1, aes(x=high_use, y= famrel,col=sex))
g2+ geom_boxplot()

g3<-ggplot(alc1, aes(x=high_use, y=health,col=sex))
g3+ geom_boxplot()

cor(alc1$high_use, alc1$age)
## [1] 0.105543

Contrary to my hypothesis, health does not seem to be associated with high alcohol use, though much of the participants were in good health, which may mask some of the effect (i.e. ceiling effect). High alc use correlated positively with age, which is in line with my hypothesis. Also, in line with the earlier hypothesis, men/boys seem to use alcohol more. Family relationships have high mean, and some outliers at the lower end, hence the variable may not be suited to be used as continuous variable in the analyses (thou that is what I intend to do)

#Logistic regression

r1<-glm(high_use~age+sex+health+famrel, data= alc1, family="binomial")
summary(r1)
## 
## Call:
## glm(formula = high_use ~ age + sex + health + famrel, family = "binomial", 
##     data = alc1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4646  -0.8332  -0.6688   1.1524   2.2500  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.17523    1.75142  -2.384  0.01713 *  
## age          0.22881    0.09991   2.290  0.02201 *  
## sexM         0.95044    0.24127   3.939 8.17e-05 ***
## health       0.12605    0.08830   1.427  0.15344    
## famrel      -0.36626    0.12858  -2.849  0.00439 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 422.29  on 365  degrees of freedom
## AIC: 432.29
## 
## Number of Fisher Scoring iterations: 4

of the chosen variables male sex and family relationships were the statistically significant at alpha level .001, and age at alpha level .05. Current health status was not associated woth high alcohol use. Male sex was associated with considerable increase (log odds= 0.95) in high alcohol use, and age was associated with slight increase (log odds= .22). Good family relationships were associated with lower log odds of high alcohol use (log odds= -.36).

# compute odds ratios (OR)
OR <- coef(r1) %>% exp

# compute confidence intervals (CI)
CI<-confint(r1)%>% exp
## Waiting for profiling to be done...
print(cbind(OR, CI))
##                     OR        2.5 %    97.5 %
## (Intercept) 0.01537168 0.0004744108 0.4655281
## age         1.25710707 1.0347423229 1.5328666
## sexM        2.58685790 1.6201078828 4.1787499
## health      1.13434220 0.9560769767 1.3527013
## famrel      0.69332056 0.5371929276 0.8909827

Male sex was associated with 2.5 higher odds of having high alcohol use, compared to females, and the 95% confidence intervals range from 1.6 to 4.2. Family relationship was associated with 0.7 times lower odds of having high alcohol use, meaning that one unit increase in family relationship scale decreased the odds of having high alcohol use. Lastly one year increase in age increased the odds of having high alcohol use by 1.2, and 95 CI ranged from 1.03 to 1.5.

# predict() the probability of high_use
probabilities <- predict(r1, type = "response")

# add the predicted probabilities to 'alc'
alc1 <- mutate(alc1, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc1 <- mutate(alc1, prediction = ifelse(probabilities > 0.5,T,F))

# see the last ten original classes, predicted probabilities, and class predictions

# tabulate the target variable versus the predictions
table(high_use = alc1$high_use, prediction = alc1$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   246   13
##    TRUE     91   20
alc1 <- mutate(alc1, prediction = probability > 0.5)


# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc1$high_use, prob = alc1$probability)
## [1] 0.2810811

Model is correct in 246+20/13+91= 72% of the cases, compared to random guessing (50/50). Training error is 28%.

Cross validation

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc1, cost = loss_func, glmfit = r1, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3

Current model predicts wrongly in about 30% of the cases in the cross-validation. The number is higher compared to the training data as the model may be over fitted by exploiting the random variation of the training data, which in turn may not be generalized to other data. Cross validation uses smaller subsets of the data to test the generalizability and over-fitting.


4: Clustering and classification

#Loading the data and axploring the dimensions and structure
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Boston dataset is a part of MASS, and it includes housing values in suburbs of boston. It has 506 observations and 14 variables, of which 12 are numerical and 2 are integers.

#Visualizing
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
cors <- cor(Boston) %>% round(2)
corrplot(cors,method = "pie",type= "upper")

summary(cors)
##       crim               zn              indus              chas         
##  Min.   :-0.3900   Min.   :-0.5700   Min.   :-0.7100   Min.   :-0.12000  
##  1st Qu.:-0.2150   1st Qu.:-0.4050   1st Qu.:-0.3825   1st Qu.:-0.04750  
##  Median : 0.3200   Median :-0.2550   Median : 0.3950   Median : 0.02000  
##  Mean   : 0.1786   Mean   :-0.0550   Mean   : 0.1929   Mean   : 0.08143  
##  3rd Qu.: 0.4500   3rd Qu.: 0.2775   3rd Qu.: 0.6300   3rd Qu.: 0.09000  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##       nox               rm                age               dis         
##  Min.   :-0.770   Min.   :-0.61000   Min.   :-0.7500   Min.   :-0.7700  
##  1st Qu.:-0.360   1st Qu.:-0.29750   1st Qu.:-0.2625   1st Qu.:-0.5225  
##  Median : 0.305   Median :-0.21500   Median : 0.3050   Median :-0.3050  
##  Mean   : 0.190   Mean   :-0.01286   Mean   : 0.1736   Mean   :-0.1464  
##  3rd Qu.: 0.655   3rd Qu.: 0.19000   3rd Qu.: 0.5775   3rd Qu.: 0.2400  
##  Max.   : 1.000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.0000  
##       rad               tax             ptratio            black         
##  Min.   :-0.4900   Min.   :-0.5300   Min.   :-0.5100   Min.   :-0.44000  
##  1st Qu.:-0.2850   1st Qu.:-0.3050   1st Qu.:-0.2175   1st Qu.:-0.37750  
##  Median : 0.4600   Median : 0.4850   Median : 0.2250   Median :-0.22500  
##  Mean   : 0.2371   Mean   : 0.2364   Mean   : 0.1157   Mean   :-0.06071  
##  3rd Qu.: 0.6075   3rd Qu.: 0.6475   3rd Qu.: 0.3775   3rd Qu.: 0.16750  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##      lstat              medv         
##  Min.   :-0.7400   Min.   :-0.74000  
##  1st Qu.:-0.4000   1st Qu.:-0.46000  
##  Median : 0.4150   Median :-0.38000  
##  Mean   : 0.1407   Mean   :-0.06857  
##  3rd Qu.: 0.5775   3rd Qu.: 0.31000  
##  Max.   : 1.0000   Max.   : 1.00000
#par(mfrow=c(4, 4))
colnames <- dimnames(Boston)[[2]]
for (i in seq_along(colnames)) {
    hist(Boston[,i], main=colnames[i], probability=TRUE, col="pink", border="white")
}

The distributions of rm,lstat seem normally distributed. age, dis,medv seem skewed. Indus and tax seem to have outliers/ otherwise high variability in the observed values as do black, crim and chas (when comparing the histograms with the printed summary.

Strongest negative correlations are with distance to employment centers and industry, nitrogen oxide concentration and age (unit build before 1940), all > -.7. Particularly interesting are the weak correlations of chas with any other variable. Of the positive correlations strongest ones are between tax and rad (tax-rate and access to radiall highways), indus and nox and age and nox and indus and nox.

#scaling the data
library(dplyr)
bs <- as.data.frame(scale(Boston))
summary(bs)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
#crime as factor using quantiles
q<-quantile(bs$crim)
bs$crime <- cut(bs$crim, breaks = q, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
#removing original crim variable
bs<- dplyr::select(bs, -crim)

Standardizing the data centers the obs around 0 so that the distribution has standard distribution of 1, i.e. the data is normally distributed.

#Fitting lda
#copy-pasting the code for training set 
n <- nrow(bs)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- bs[ind,]
# create test set 
test <- bs[-ind,]


#fitting the lda
eka<-lda(crime~., data= train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

eka
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2698020 0.2549505 0.2351485 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.81167183 -0.9206005 -0.06938576 -0.8507568  0.4341715 -0.8185735
## med_low  -0.07400687 -0.3254960 -0.05560795 -0.5841229 -0.1286559 -0.3626409
## med_high -0.36901585  0.1616886  0.14813795  0.3802477  0.2090196  0.4458561
## high     -0.48724019  1.0172655 -0.02367011  1.0552843 -0.4741261  0.7995404
##                 dis        rad        tax     ptratio      black       lstat
## low       0.7991229 -0.6906105 -0.7364161 -0.42612800  0.3854350 -0.75305321
## med_low   0.3617409 -0.5446108 -0.4919750 -0.06548531  0.3157669 -0.18713828
## med_high -0.3863683 -0.4176728 -0.3118455 -0.36647466  0.1277994 -0.09906541
## high     -0.8593528  1.6366336  1.5129868  0.77903654 -0.8399980  0.89239127
##                 medv
## low       0.51710336
## med_low   0.01947573
## med_high  0.27448588
## high     -0.68249927
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.10237909  0.596918661 -1.06752211
## indus    0.04311230 -0.331028056  0.48223613
## chas    -0.02622842 -0.002121352  0.02685402
## nox      0.46407532 -0.965253570 -1.19657559
## rm       0.03735499 -0.188860891 -0.28082476
## age      0.23297991 -0.437789978 -0.13455727
## dis     -0.08509867 -0.383114418  0.23296421
## rad      3.33582480  0.818208367 -0.01991887
## tax      0.02407188  0.136833881  0.20432316
## ptratio  0.13498305  0.024654329 -0.28459343
## black   -0.12231523  0.023483289  0.08825579
## lstat    0.14546144 -0.034978551  0.49015104
## medv     0.08077687 -0.376207161 -0.01231084
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9522 0.0357 0.0121
classes <- as.numeric(train$crime)
plot(eka, dimen = 2,col=classes, pch=classes)
lda.arrows(eka, myscale = 2)

Predicting the correct classes

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
pred <- predict(eka, newdata = test)
table(correct=correct_classes, predicted=pred$class)
##           predicted
## correct    low med_low med_high high
##   low       20       9        1    0
##   med_low    1      12        4    0
##   med_high   1      13        7    2
##   high       0       0        0   32

The lda predicts the “high” class well, with few to none misclassifications. There seems to ambiquity with lower levels of the cime variable.

##Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

data("Boston")
dat1<-scale(Boston)
dat<-as.data.frame(scale(Boston))
#Distances
dist_eu<-dist(dat1)

#setting seed for reproducibility
set.seed(666)
# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dat, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <- kmeans(dat, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

2 seems to be the optimal number of groups. Some variables seem to differentiate the groups better, for example tax and rad, but for most variables there seems to be a lot of overlap.

Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises).

set.seed(999)
km2<-kmeans(dat,centers = 4)
targ<-km2$cluster
#variable chas appeared to be constant across groups, thus it was removed
lda2<-lda(targ~crim+zn+age+dis+rad+tax+ptratio+black+lstat+medv+nox+indus+rm, data = dat)
lda2<-lda(targ~., data = dat)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(targ)

# plot the lda results
plot(lda2, dimen = 2,col= classes)
lda.arrows(lda2, myscale = 1)

pairs(dat[6:10],col= targ)

Radiation seems to affect the model most. ***

5: Dimensionality reduction techniques

Downloading the data, and showing summaries and graphical overview

library(dplyr)
library(GGally)
dat<-read.csv("./data/human2.csv")
row.names(dat)<-dat$X
dat<-dat[,2:9]
summary(dat)
##     sedu_rat         lab_rat            edu            liexp      
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni            mat_mor            abr           repre_parl   
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(dat)

Data has 155 observations and 8 variables.

VARIABLES: country= Country liexp= life expectancy gni= Gross National Income per capita liexp = Life expectancy at birth edu = Expected years of schooling mat_mor = Maternal mortality ratio abr = Adolescent birth rate repre_parl = Percetange of female representatives in parliament sedu_rat = ratio of females with at least secondary education divided by males with at least secondary education. lab_rat= proportion of females in work force divided by the proportion of males in work force.

Most of the variables seem to normally distributed. abr seems to be skewed to the right, as is mat_mor. gni is heavily skewed to the right. Life-expectancy seems.

Principal component analysis (PCA) on the raw (non-standardized) human data and bi-plot

A<-prcomp(dat)
summary(A)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
s<-summary(A)
pca_pr <- round(1*s$importance[2, ], digits = 5)
pca_pr2<-round(pca_pr*100,digits = 2)
pca_lab<-paste0(names(pca_pr), " (", pca_pr2, "%)")

biplot(A, choices = 1:2, cex = c(0.8, 1), col = c("grey70", "deeppink2"), xlab = pca_lab[1], ylab = pca_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

PCA on standardized data

dat1<-scale(dat)
dat1<-as.data.frame(dat1)
B<-prcomp(dat1)
summary(B)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
s<-summary(B)
pca_pr <- round(1*s$importance[2, ], digits = 5)
pca_pr2<-round(pca_pr*100,digits = 2)
pca_lab<-paste0(names(pca_pr), " (", pca_pr2, "%)")

biplot(B, choices = 1:2, cex = c(0.8, 1), col = c("grey70", "deeppink2"), xlab = pca_lab[1], ylab = pca_lab[2])

Both plot and the explained variance changed drastically. PCA seeks to explain variance. Without standardizing, the first variable seems to explain most of the variance. After standardizing it is obvious that other components also contribute, but only the first two components explain more than 10% of the variance.

In the first plot, only gni /which happens to be the first variable) seems to explain the variance, whereas in the second plot maternal mortality, birth rate, education, life expectancy and ratio of the secondary education seem to explain the first component and ratio of labour force and representation of women in parliament explain the second component.

Interpretations of the obtained results

It seems that first component relates to welfare statish and the component2 more to equity of the sexes. When looking at the countries in the plot, most of the western states are located in the left, whereas less wealthy countires are located in the right. When looking at the second component, countries in which the status of women is not equal to that of men are located at the bottom of plot. INterestingly some of the african countries are at the top of the plot. It is worth noting that there may be some interactions between the components chosen by pca, as in wealthier countries it may possible for the other sex to not work, whereas in poorer countries both sexes may have to work

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

Load the tea dataset and convert its character variables to factors:

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)


library(dplyr)
library(tidyr)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
tea1<-(tea[,1:10])
ggpairs(tea[,1:10])

pivot_longer(tea1, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Exploring the data briefly.

library(FactoMineR)

mca <- MCA(tea1, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea1, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.179   0.129   0.115   0.107   0.104   0.090   0.079
## % of var.             17.939  12.871  11.481  10.654  10.406   8.985   7.917
## Cumulative % of var.  17.939  30.810  42.291  52.945  63.351  72.337  80.253
##                        Dim.8   Dim.9  Dim.10
## Variance               0.076   0.068   0.054
## % of var.              7.559   6.793   5.395
## Cumulative % of var.  87.812  94.605 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.543  0.549  0.474 | -0.462  0.552  0.342 | -0.007  0.000
## 2             | -0.543  0.549  0.474 | -0.462  0.552  0.342 | -0.007  0.000
## 3             | -0.068  0.009  0.002 |  0.542  0.761  0.141 | -0.139  0.056
## 4             | -1.037  1.998  0.558 |  0.264  0.180  0.036 | -0.018  0.001
## 5             | -0.235  0.103  0.061 | -0.011  0.000  0.000 |  0.124  0.044
## 6             | -1.037  1.998  0.558 |  0.264  0.180  0.036 | -0.018  0.001
## 7             |  0.024  0.001  0.001 | -0.404  0.422  0.374 | -0.185  0.099
## 8             | -0.193  0.069  0.054 |  0.058  0.009  0.005 | -0.382  0.423
## 9             | -0.258  0.124  0.117 | -0.624  1.007  0.680 | -0.130  0.049
## 10            | -0.048  0.004  0.002 | -0.153  0.061  0.020 | -0.176  0.090
##                 cos2  
## 1              0.000 |
## 2              0.000 |
## 3              0.009 |
## 4              0.000 |
## 5              0.017 |
## 6              0.000 |
## 7              0.078 |
## 8              0.210 |
## 9              0.030 |
## 10             0.027 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |   0.209   1.168   0.040   3.471 |  -0.770  22.090   0.547
## Not.breakfast |  -0.193   1.078   0.040  -3.471 |   0.710  20.391   0.547
## Not.tea time  |  -0.680  11.253   0.358 -10.351 |   0.327   3.635   0.083
## tea time      |   0.527   8.723   0.358  10.351 |  -0.254   2.817   0.083
## evening       |   0.445   3.787   0.103   5.562 |   0.634  10.728   0.210
## Not.evening   |  -0.233   1.980   0.103  -5.562 |  -0.332   5.609   0.210
## lunch         |   0.947   7.329   0.154   6.788 |   0.167   0.317   0.005
## Not.lunch     |  -0.163   1.260   0.154  -6.788 |  -0.029   0.054   0.005
## dinner        |  -1.571   9.633   0.186  -7.454 |   1.044   5.925   0.082
## Not.dinner    |   0.118   0.725   0.186   7.454 |  -0.079   0.446   0.082
##                v.test     Dim.3     ctr    cos2  v.test  
## breakfast     -12.786 |  -0.031   0.041   0.001  -0.518 |
## Not.breakfast  12.786 |   0.029   0.038   0.001   0.518 |
## Not.tea time    4.983 |   0.235   2.102   0.043   3.579 |
## tea time       -4.983 |  -0.182   1.630   0.043  -3.579 |
## evening         7.929 |  -0.599  10.742   0.188  -7.494 |
## Not.evening    -7.929 |   0.313   5.616   0.188   7.494 |
## lunch           1.195 |  -1.133  16.410   0.221  -8.125 |
## Not.lunch      -1.195 |   0.195   2.820   0.221   8.125 |
## dinner          4.951 |  -0.090   0.050   0.001  -0.428 |
## Not.dinner     -4.951 |   0.007   0.004   0.001   0.428 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.040 0.547 0.001 |
## tea.time      | 0.358 0.083 0.043 |
## evening       | 0.103 0.210 0.188 |
## lunch         | 0.154 0.005 0.221 |
## dinner        | 0.186 0.082 0.001 |
## always        | 0.089 0.096 0.414 |
## home          | 0.008 0.114 0.005 |
## work          | 0.215 0.006 0.251 |
## tearoom       | 0.315 0.003 0.018 |
## friends       | 0.325 0.141 0.008 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")

According the plot, having tea at diner or not at home have the greatest impact on the first 2 dimensions, variables close to horizontal x-axis have no lesser effect on dimension 2 and variables closer to y axis have less effect on dimension1. Most of the “not” variables are close to origo, and their counterparts are located further away, meaning that they have impact on the two dimensions.

Dimension 2 seems to reflect the social affect of drinking tea. Factors related to social aspects of drinking tea seem to affect the y-value in the plot. Dimension 1 is harder to interpret, but one possibility is that dim1 reflects the “tea as a hobby” kind of aspect. things like drinking tea in a tearoom and having tea time.


Loading the data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.7     ✔ stringr 1.4.0
## ✔ readr   2.1.3     ✔ forcats 0.5.1
## ✔ purrr   0.3.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ MASS::select()  masks dplyr::select()
dat1<-read.csv("./data/BPRS.csv")
dat2<-read.csv("./data/rats.csv")
dat2og<-read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",sep = "\t",header = T)

#changing to factors
dat1<-dat1 %>% mutate(treatment=as.factor(treatment)) %>%
  mutate(subject= as.factor(subject)) %>%
  mutate(id=as.factor(id))
dat2<-dat2 %>% mutate(ID=as.factor(ID)) %>%
  mutate(Group=as.factor(Group))

#dat1 <-  dat1 %>% mutate(week = as.integer(substr(weeks,5,5)))
#dat2 <-  dat2 %>% mutate(time = as.integer(substr(time,3,3)))

Dat2 has data on xxx

First I will visualize the data

dat2 <- dat2 %>%
    group_by(time1) %>%
  mutate( stdweight = (weight - mean(weight))/sd(weight) ) %>%
  ungroup()

# Plot again with the standardised weight
library(ggplot2)
ggplot(dat2, aes(x = time1, y = stdweight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

There seems to a noticable difference between group1 and other groups. Though it is of notice that groups 2 and 3 have 4 and 3 participants, respectively.

dat22 <- dat2 %>%
  filter(time1>1) %>%
  group_by(Group, time1) %>%
  summarise( mean = mean(weight), se = sd(weight)/sqrt(16) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(dat22, aes(x = time1, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(weight) +/- se(weight)")

dat22 <- dat2 %>%
  filter(time1>1) %>%
  group_by(Group, ID) %>%
  summarise( mean = mean(weight), se = sd(weight)/sqrt(16) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(dat22, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), time 8-64")

Time doesnt seem to effect weight, and values at baseline predict the course pretty well.

Lets conduct anova to test the differences. Though according the plots groups 2 and three differ from group1.

dat3 <- dat2 %>%
  filter(time1 > 1) %>%
  group_by(Group,ID) %>%
  summarise( mean=mean(weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
dat3 <- dat3 %>%
  mutate(baseline = dat2og$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean~baseline+Group, data = dat3)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Groups are close to statistical signifigance. Baseline seems to predict mean well, and groups did not differ from each other.

I will do contrast to checj if group 1 differs from other groups

dat4<-dat3 %>% mutate(group1=ifelse(Group==1,1,2)) 
fit2 <- lm(mean~baseline+group1, data = dat4)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit2)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 253625  253625 1806.5913 2.448e-15 ***
## group1     1    690     690    4.9159   0.04505 *  
## Residuals 13   1825     140                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we combine group 2 and 3, they seem to differ from group 1 at alpha-level .05. As the analysis was done ad-hoc, it should not be interpreted as hypothesis testing, but rather as exploring the data.

##Task2

Dat1 has data on 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment (week 0) and follow-up data waws collected weekly for 8 weeks. The BPRS is used to evaluate patients suspected of having schizophrenia.

Standardizing the data and drawing plots

dat1 <- dat1 %>%
    group_by(week) %>%
  mutate( stdbprs = (bprs - mean(bprs))/sd(bprs) ) %>%
  ungroup()

# Plot again with the standardised bprs
library(ggplot2)
ggplot(dat1, aes(x = week, y = stdbprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(name = "standardized bprs")

dat11<-dat1%>% arrange(week)

ggplot(dat11, aes(x = week, y = stdbprs, group = id)) +
  geom_line(aes(linetype=treatment))+
  scale_y_continuous(name = "bprs")+
  theme(legend.position = "top")

Standardized bprs scores do not seem to differ greatly between groups. ##Regression

Next I fit regression to the data, in a stepwise manner. First a regular linear regression that does not take into account that data has multiplle observation from each of the participants, ie. we cannot assume that the observation are independnt of each others, but clustered within participants. Next I fit a model with that has a random intercept to account for the correlations within individual, so that intercept is allowed to vary between participants. LAstly in this section I will fit random intercept random slope model in, in addition to intercept, also slope is allowed to vary by participant.

# create a regression model RATS_reg
reg <- lm(bprs~ week+treatment, data = dat1)

# print out a summary of the model
summary(reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
ref <- lmer(bprs ~ week + treatment + (1 | subject), data = dat1, REML = FALSE)

# Print the summary of the model
summary(ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: dat1
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000
ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = dat1, REML = FALSE)

# print a summary of the model
summary(ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: dat1
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(ref1, ref)
## Data: dat1
## Models:
## ref: bprs ~ week + treatment + (1 | subject)
## ref1: bprs ~ week + treatment + (week | subject)
##      npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## ref     5 2748.7 2768.1 -1369.4   2738.7                       
## ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random intercept reduces the error of the model compared to regular regression. According to anova, the model with random intercept and random slope fits the data better at alpha level =.05 . This means that ref1 explains more variation in the data compared to ref.

Lastly, I will fit a model that allows interaction between week and treatment

ref2 <- lmer(bprs ~ week + treatment + (week | subject) + (week*treatment), data = dat1, REML = FALSE)
summary(ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + (week * treatment)
##    Data: dat1
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0513 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0048  8.0626        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4702  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(ref2,ref1)
## Data: dat1
## Models:
## ref1: bprs ~ week + treatment + (week | subject)
## ref2: bprs ~ week + treatment + (week | subject) + (week * treatment)
##      npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitted <- fitted(ref1)
dat1$fitted<-Fitted

# draw the plot of RATSL with the Fitted values of weight

ggplot(dat1, aes(x = week, y = fitted, group= id)) +
  geom_line(aes(linetype=treatment))+
  scale_x_continuous(name = "week", breaks = seq(0, 10, 2)) +
  scale_y_continuous(name = "Fitted weight (grams)") +
  theme(legend.position = "top")

Model that allows interaction between week and treatment does not improve the fit at alpha level = .05. When looking at the results from ref1, ( model with random intercepts and random slopes) the two groups do not seem to differ from each other. We dont know if the other group was control group or did they compare two different treatments, so all we can say is that the treatments do not differ between their effect on bprs. All the participants showed a linear trend, with weeks being associated with lower scores on bprs.